Attaching package: 'keras'
The following object is masked from 'package:yardstick':
get_weights
library(ranger)library(stacks)library(xgboost)
Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':
slice
library(ggplot2)
Predicting ICU duration
Using the ICU cohort mimiciv_icu_cohort.rds you built in Homework 4, develop at least three machine learning approaches (logistic regression with enet regularization, random forest, boosting, SVM, MLP, etc) plus a model stacking approach for predicting whether a patient’s ICU stay will be longer than 2 days. You should use the los_long variable as the outcome. You algorithms can use patient demographic information (gender, age at ICU intime, marital status, race), ICU admission information (first care unit), the last lab measurements before the ICU stay, and first vital measurements during ICU stay as features. You are welcome to use any feature engineering techniques you think are appropriate; but make sure to not use features that are not available at an ICU stay’s intime. For instance, last_careunit cannot be used in your algorithms.
Answer:
Notes: Here I also upload some rds files (ML models) to save your render time.
1. Data preprocessing and feature engineering
First, I load the ICU cohort mimic_icu_cohort.rds and select the features I need for the machine learning algorithms.
# define the variables I needctg_vars <-c("race", "insurance", "marital_status", "gender", "first_careunit")lab_vars <-c("Creatinine", "Potassium", "Sodium", "Chloride", "Bicarbonate", "Hematocrit", "Glucose", "White_Blood_Cells")vital_vars <-c("Heart_Rate", "Non_Invasive_Blood_Pressure_systolic", "Non_Invasive_Blood_Pressure_diastolic", "Temperature_Fahrenheit", "Respiratory_Rate")all_vars <-c(ctg_vars, lab_vars, vital_vars, "los_long", "age_intime", "subject_id", "hadm_id", "stay_id")# import data from rds filemimic_icu_cohort <-read_rds("../hw4/mimiciv_shiny/mimic_icu_cohort.rds") |>rename_with(~gsub(" ", "_", .x)) |>select(all_of(all_vars)) |>as_tibble()# convert the categorical variables to factormimic_icu_cohort <- mimic_icu_cohort |>mutate(across(all_of(ctg_vars), factor))# summary of the dataset (table)mimic_icu_cohort |>select(-subject_id, -hadm_id, -stay_id) |>tbl_summary(by = los_long)
Characteristic
FALSE, N = 38,0501
TRUE, N = 35,1311
race
ASIAN
1,148 (3.0%)
1,007 (2.9%)
BLACK
4,311 (11%)
3,649 (10%)
HISPANIC
1,492 (3.9%)
1,249 (3.6%)
Other
5,160 (14%)
5,596 (16%)
WHITE
25,939 (68%)
23,630 (67%)
insurance
Medicaid
3,060 (8.0%)
2,468 (7.0%)
Medicare
16,489 (43%)
16,602 (47%)
Other
18,501 (49%)
16,061 (46%)
marital_status
DIVORCED
2,843 (8.0%)
2,561 (7.9%)
MARRIED
16,984 (48%)
15,784 (49%)
SINGLE
11,175 (31%)
9,683 (30%)
WIDOWED
4,699 (13%)
4,339 (13%)
Unknown
2,349
2,764
gender
F
17,014 (45%)
15,349 (44%)
M
21,036 (55%)
19,782 (56%)
first_careunit
Cardiac Vascular Intensive Care Unit (CVICU)
5,827 (15%)
5,755 (16%)
Medical Intensive Care Unit (MICU)
8,782 (23%)
7,116 (20%)
Medical/Surgical Intensive Care Unit (MICU/SICU)
7,147 (19%)
5,586 (16%)
Surgical Intensive Care Unit (SICU)
5,654 (15%)
5,507 (16%)
Other
10,640 (28%)
11,167 (32%)
Creatinine
1.00 (0.80, 1.40)
1.00 (0.80, 1.60)
Unknown
2,599
3,171
Potassium
4.20 (3.80, 4.60)
4.20 (3.80, 4.70)
Unknown
4,189
4,712
Sodium
139.0 (136.0, 141.0)
138.0 (135.0, 141.0)
Unknown
4,174
4,698
Chloride
102 (98, 105)
102 (98, 105)
Unknown
4,175
4,708
Bicarbonate
25.0 (22.0, 27.0)
24.0 (21.0, 27.0)
Unknown
4,270
4,780
Hematocrit
36 (30, 40)
35 (29, 40)
Unknown
2,236
2,781
Glucose
118 (98, 153)
122 (100, 159)
Unknown
4,282
4,817
White_Blood_Cells
9.0 (6.6, 12.6)
9.7 (7.0, 13.8)
Unknown
2,272
2,822
Heart_Rate
85 (74, 99)
88 (76, 103)
Unknown
17
1
Non_Invasive_Blood_Pressure_systolic
122 (107, 139)
120 (104, 138)
Unknown
732
247
Non_Invasive_Blood_Pressure_diastolic
68 (58, 80)
66 (55, 79)
Unknown
733
250
Temperature_Fahrenheit
98.10 (97.60, 98.60)
98.20 (97.60, 98.80)
Unknown
1,189
173
Respiratory_Rate
18.0 (15.0, 22.0)
19.0 (15.0, 23.0)
Unknown
87
11
age_intime
65 (53, 77)
67 (56, 78)
1 n (%); Median (IQR)
Then, I check the missingness of the variables and visualize it as follows. From the plot, we can see that the missingness of the variables here is not severe as the most missingness is below 15%. Therefore, I will impute the missing values instead of removing the variables.
# display the missing distribution of the variables and visualize itmimic_icu_cohort |>map_df(~sum(is.na(.x))/nrow(mimic_icu_cohort)) |>gather(variable, missing) |>filter(missing >0) |>arrange(desc(missing)) |>ggplot(aes(x =reorder(variable, missing), y = missing)) +geom_col() +labs(title ="Missingness of Variables",x ="Variables",y ="Proportion of Missingness") +# display the value of the bar and approximate to 0.01%geom_text(aes(label = scales::percent(missing, accuracy =0.01)), vjust =-1, size =3) +coord_cartesian(ylim =c(0, 0.15)) +theme(axis.text.x =element_text(angle =45, hjust=1)) +theme(panel.background =element_rect(fill ="white"),panel.grid =element_line(color ="gray", linewidth =0.2),panel.border =element_rect(fill =NA, linewidth =0.5))
Before imputing the missing values, I convert the outliers of the continuous variables to the missing values.
# write a function to convert the outliers to missing valuesoutlier_to_na <-function(x) { q1 <-quantile(x, 0.25, na.rm =TRUE) q3 <-quantile(x, 0.75, na.rm =TRUE) iqr <- q3 - q1 x[x < (q1 -1.5* iqr) | x > (q3 +1.5* iqr)] <-NA x}# convert the outliers (IQR method) to missing values for lab and vital eventsmimic_icu_replace <- mimic_icu_cohort |>mutate(across(c(lab_vars, vital_vars), outlier_to_na))
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(c(lab_vars, vital_vars), outlier_to_na)`.
Caused by warning:
! Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(lab_vars)
# Now:
data %>% select(all_of(lab_vars))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
Here, I use miceRanger to impute the missing values. To save render time, I check if the imputed dataset is already saved. If not, I impute the missing values using miceRanger and save the imputed dataset as mimic_icu_mice.rds.
To check the imputed values, I plot the distributions of the imputed variables (black) and compare with the original ones (red). From the plot, we can see that the imputed values are similar to the original ones, which means the imputation is successful.
# plot distributions of the imputed variables and compare with the original onesplotDistributions(mimic_icu_mice, vars ='allNumeric')
After imputing the missing values, I choose the first imputed dataset and convert non-numeric variables to factors and scale numeric variables.
# choose the first imputed datasetmimic_icu_imputed <-completeData(mimic_icu_mice)[[1]]# convert non-numeric variables to factors and scale numeric variablesmimic_icu_final <- mimic_icu_imputed |>mutate_if(is.character, as.factor) |>mutate_if(is.numeric, scale) |>mutate_if(is.logical, as.factor)
2. Partition data into 50% training set and 50% test set
Stratify partitioning according to los_long. For grading purpose, sort the data by subject_id, hadm_id, and stay_id and use the seed 203 for the initial data split.
• Zero variance filter on: all_numeric_predictors()
3. Train and tune the models using the training set
Here I will use the logistic regression with enet regularization, random forest, boosting, and model stacking to predict whether a patient’s ICU stay will be longer than 2 days.
First I set up the cross-validation folds to be shared by all models.
# set cross-validation partitionsset.seed(203)folds <-vfold_cv(train_set, v =5)folds
# set up the random forest modelrf_mod <-rand_forest(mode ="classification",# Number of predictors randomly sampled in each splitmtry =tune(),# Number of trees in ensembletrees =tune() ) |>set_engine("ranger", importance ="impurity")rf_mod
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
# bundle the recipe (R) and model into workflow.rf_wf <-workflow() |>add_recipe(icu_recipe) |>add_model(rf_mod)rf_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
# tune the number of trees and the number of features to use in each splitrf_grid <-grid_regular(trees(range =c(300L, 800L)), mtry(range =c(1L, 5L)),levels =c(10, 10) )# fit CVif (file.exists("rf_res.rds")) { rf_res <-read_rds("rf_res.rds")} else { rf_res <-tune_grid(object = rf_wf, resamples = folds, grid = rf_grid,control =control_stack_grid() )write_rds(rf_res, "rf_res.rds")}rf_res
# the best modelbest_rf <- rf_res |>select_best("roc_auc")best_rf
# A tibble: 1 × 3
mtry trees .config
<int> <int> <chr>
1 2 688 Preprocessor1_Model18
# final workflowfinal_rf <- rf_wf |>finalize_workflow(best_rf)final_rf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = 2
trees = 688
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
3.3 Boosting
# set up the boosting modelgb_mod <-boost_tree(mode ="classification",trees =1000, tree_depth =tune(),learn_rate =tune() ) |>set_engine("xgboost")gb_mod
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 1000
tree_depth = tune()
learn_rate = tune()
Computational engine: xgboost
# bundle the recipe (R) and model into workflow.gb_wf <-workflow() |>add_recipe(icu_recipe) |>add_model(gb_mod)gb_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 1000
tree_depth = tune()
learn_rate = tune()
Computational engine: xgboost
# tune the tree depth and learning rategb_grid <-grid_regular(tree_depth(range =c(1L, 5L)),learn_rate(range =c(-4, 1), trans =log10_trans()),levels =c(3, 10) )# fit CVif (file.exists("gb_res.rds")) { gb_res <-read_rds("gb_res.rds")} else { gb_res <-tune_grid(object = gb_wf,resamples = folds,grid = gb_grid,control =control_stack_grid() )write_rds(gb_res, "gb_res.rds")}gb_res
# final workflowfinal_gb <- gb_wf |>finalize_workflow(best_gb)final_gb
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 1000
tree_depth = 5
learn_rate = 0.00464158883361278
Computational engine: xgboost
3.4 Model stacking
# build the stacked ensembleif (file.exists("icu_model_st.rds")) { icu_model_st <-read_rds("icu_model_st.rds")} else { icu_model_st <-# initialize the stackstacks() |># add candidate membersadd_candidates(logit_res) |>add_candidates(rf_res) |>add_candidates(gb_res) |># determine how to combine their predictionsblend_predictions(penalty =10^(-6:2),metrics =c("roc_auc") ) |># fit the candidates with nonzero stacking coefficientsfit_members()write_rds(icu_model_st, "icu_model_st.rds")}
icu_model_st
── A stacked ensemble model ─────────────────────────────────────
Out of 217 possible candidate members, the ensemble retained 22.
Penalty: 0.001.
Mixture: 1.
The 10 highest weighted member classes are:
# show the relationship more directlyautoplot(icu_model_st, type ="members")
# see the top resultsautoplot(icu_model_st, type ="weights")
4. Compare model classification performance on the test set
Report both the area under ROC curve and accuracy for each machine learning algorithm and the model stacking. Interpret the results. What are the most important features in predicting long ICU stays? How do the models compare in terms of performance and interpretability?
4.1 AUC and accuracy of each model
4.1.1 Logistic regression with enet regularization
I fit the whole training set and predict the test cases based on logistic regression. Then I compute the test metrics as follows.
The AUC of the logistic regression model is 0.600. This means that there is a 60% chance that the model will be able to distinguish between a randomly chosen positive instance and a randomly chosen negative instance.
The accuracy is 0.573, which means that 57.3% of the logistic regression model’s predictions are correct.
# Fit the whole training set, then predict the test cases based on logistic regressionlogit_fit <- final_logit |>last_fit(data_split)# Test metricslogit_fit |>collect_metrics()
I fit the whole training set and predict the test cases based on random forest. Then I compute the test metrics as follows.
The AUC of the random forest model is 0.620. This means that there is a 62% chance that the model will be able to distinguish between a randomly chosen positive instance and a randomly chosen negative instance.
The accuracy is 0.586, which means that 58.6% of the random forest model’s predictions are correct.
# Fit the whole training set, then predict the test cases based on RFrf_fit <- final_rf |>last_fit(data_split)# Test metricsrf_fit |>collect_metrics()
I fit the whole training set and predict the test cases based on boosting. Then I compute the test metrics as follows.
The AUC of the boosting model is 0.617. This means that there is a 61.7% chance that the model will be able to distinguish between a randomly chosen positive instance and a randomly chosen negative instance.
The accuracy is 0.583, which means that 58.3% of the boosting model’s predictions are correct.
# Fit the whole training set, then predict the test cases based on GBgb_fit <- final_gb |>last_fit(data_split)# Test metricsgb_fit |>collect_metrics()
I predict the test cases based on model stacking by probability and by class respectively. Then I compute the AUC and accuracy as follows.
The AUC of the stacking model is 0.622. This means that there is a 62.2% chance that the model will be able to distinguish between a randomly chosen positive instance and a randomly chosen negative instance.
The accuracy is 0.586, which means that 58.6% of the stacking model’s predictions are correct.
# predict the test set by probabilityicu_test_pred_st <- icu_model_st |>predict(test_set, type ="prob") |>bind_cols(test_set) |>print()
# A tibble: 36,591 × 22
.pred_FALSE .pred_TRUE race insurance marital_status gender first_careunit
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 0.540 0.460 WHITE Medicaid WIDOWED F Medical Intensi…
2 0.521 0.479 WHITE Other MARRIED F Surgical Intens…
3 0.648 0.352 WHITE Other MARRIED F Surgical Intens…
4 0.552 0.448 WHITE Other MARRIED F Medical/Surgica…
5 0.608 0.392 Other Medicare SINGLE F Cardiac Vascula…
6 0.531 0.469 WHITE Other MARRIED F Other
7 0.341 0.659 WHITE Medicare WIDOWED F Medical Intensi…
8 0.445 0.555 WHITE Medicare WIDOWED F Medical Intensi…
9 0.504 0.496 WHITE Medicare WIDOWED M Other
10 0.366 0.634 Other Medicare MARRIED M Other
# ℹ 36,581 more rows
# ℹ 15 more variables: Creatinine <dbl[,1]>, Potassium <dbl[,1]>,
# Sodium <dbl[,1]>, Chloride <dbl[,1]>, Bicarbonate <dbl[,1]>,
# Hematocrit <dbl[,1]>, Glucose <dbl[,1]>, White_Blood_Cells <dbl[,1]>,
# Heart_Rate <dbl[,1]>, Non_Invasive_Blood_Pressure_systolic <dbl[,1]>,
# Non_Invasive_Blood_Pressure_diastolic <dbl[,1]>,
# Temperature_Fahrenheit <dbl[,1]>, Respiratory_Rate <dbl[,1]>, …
# compute the AUCst_auc <-roc_auc(icu_test_pred_st, truth = los_long, .pred_FALSE)st_auc
# predict the test set by classicu_pred_class <- test_set |>select(los_long) |>bind_cols(predict( icu_model_st, test_set,type ="class",members =TRUE ) ) |>print()
# save the accuracy of stacking modelst_accuracy <- accuracy$accuracy[1]st_accu <-data.frame(.metric ="accuracy",.estimator ="binary",.estimate = st_accuracy)
4.2 Feature importance
Here I plot the feature importance of the random forest and boosting models as follows. From the 2 plots, we can see that the most important features in predicting long ICU stays are age_intime, Heart_Rate, Non_Invasive_Blood_Pressure_systolic, White_Blood_Cells, Glucose, and Hematocrit. Among these features, age_intime is the patient’s age at ICU intime, Heart_Rate and Non_Invasive_Blood_Pressure_systolic are the first vital measurements during ICU stay, and White_Blood_Cells, Glucose and Hematocrit are the last lab measurements before the ICU stay.
4.3 Model comparison in terms of performance and interpretability
4.3.1 Performance (stacking > RF > XGB > logit)
From the performance table and ROC curves below, we can see that
Model Stacking shows the best performance in terms of both accuracy and ROC AUC, slightly outperforming the other models. This indicates that combining the predictions of multiple models can lead to a slight improvement in predictive performance.
Random Forest and Boosting models perform similarly and better than logistic regression, suggesting that tree-based methods may capture complex patterns in the data more effectively.
Logistic Regression shows the lowest performance among the models in both metrics, which might be due to its linear nature being less capable of handling complex relationships in the data.
# compare the models in terms of performancelibrary(broom)metrics_list <-list(logit = logit_fit %>%collect_metrics(),rf = rf_fit %>%collect_metrics(),boosting = gb_fit %>%collect_metrics(),stacking = st_accu, stacking = st_auc)# tabulate the accuracy and roc_auc resultsbind_rows(metrics_list, .id ="model") |>filter(.metric =="accuracy"| .metric =="roc_auc") |>select(-.estimator, -.config) |>pivot_wider(names_from = .metric, values_from = .estimate) |>print()
Logistic Regression with ENet Regularization: High. Logistic regression models provide coefficients for each feature, making it straightforward to understand the impact of each feature on the prediction. Elastic Net regularization, which combines L1 and L2 penalties, can further enhance interpretability by promoting sparsity and reducing the influence of less important features.
Random Forest: Moderate. While random forests offer insights into feature importance, indicating which features are most influential in making predictions, the ensemble nature of the model (comprising many decision trees) makes it harder to trace the exact decision path for specific predictions.
Boosting Models: Moderate to Low. Similar to random forests, boosting models can provide measures of feature importance. However, the sequential correction of errors in boosting adds complexity, making the exact reasoning behind predictions less transparent than simpler models.
Model Stacking: Low. Stacking involves combining the predictions from multiple models, which inherently reduces interpretability. Understanding how individual predictions contribute to the final decision can be challenging, as it depends on the interplay between different base models and possibly a meta-model.
4.3.3 Conclusions on model comparison
In summary, the choice of model should consider the trade-off between performance and interpretability. Logistic regression offers high interpretability but lower performance, while random forests and boosting models provide a good balance between the two. Model stacking, while showing a slight performance advantage, sacrifices interpretability due to its complex nature. Therefore, the choice of model should be based on the specific needs of the application, considering the importance of interpretability, the complexity of the data, and the desired level of predictive performance.
5. Conclusions
In this homework, I developed machine learning approaches to predict whether a patient’s ICU stay will be longer than 2 days using the MIMIC-IV ICU cohort. First, I preprocessed the data by imputing missing values/outliers and correcting data types. Then, I partitioned the data into 50% training set and 50% test set, and trained and tuned logistic regression with enet regularization, random forest, boosting models, plus a model stacking approach. At last, I compared the classification performance of the models on the test set, identified the most important features in predicting long ICU stays, and evaluated the models in terms of performance and interpretability. The results showed that model stacking slightly outperformed the other models in terms of accuracy and ROC AUC, while logistic regression provided the highest interpretability. age_intime, Heart_Rate, Non_Invasive_Blood_Pressure_systolic, White_Blood_Cells, Glucose, and Hematocrit were identified as the most important features by RF and Boosting models in predicting long ICU stays.